leaderboard: per-dataset metric derivation, chip filters + metric toggle on every table by radinhamidi · Pull Request #29 · ls3-lab/QueryGym

radinhamidi · 2026-05-20T07:18:26Z

Summary

Comprehensive table-quality pass across every leaderboard page, driven by issues found in the latest review:

No more phantom metric columns. /datasets/[id] used to render whatever eval_metrics the dataset registry listed (MAP on TREC DL, recall_1000 on BEIR) regardless of whether those metrics actually appeared in the data. The per-dataset shard now derives its columns from the actual run rows, same approach as the home matrix uses.
Chip filters on every per-X page, not just the home page. /datasets/[id] gets Method/Model/Retriever/Metric; /methods/[id] gets Model/Retriever/Metric; /models/[id] gets Method/Retriever/Metric; /retrievers/[id] gets Method/Model/Metric. Behavior matches the home page (chip→qg-chip-hidden→qg-itable-reapply handshake).
Metric toggle everywhere. Per-method/model/retriever pages now expose both primary (nDCG@10) and secondary (R@1k or R@100) per dataset column, swapped via the Metric chip.
Pretty labels everywhere. Dataset short labels + METRIC_LABEL (ndcg_cut_10 → nDCG@10, recall_1000 → R@1k, recall_100 → R@100, map → MAP) on /datasets/[id] and the per-X pages too.
Drop the ugly inner scrollbar. Home + /datasets/[id] no longer set max-h-[70vh] overflow-y-auto. The page scrolls naturally; sticky top-0 thead sticks to the viewport.
/models index renders the display label (gpt-4.1) not the provider-prefixed id (openai/gpt-4.1) — matches the /methods index convention.
/runs/[run_id] reproduce snippet rebuilt against the real example pipeline. Pyserini index names no longer have the spurious .flat.splade-pp-ed / .flat.bge-base-en-v1.5 for non-lexical paradigms; trec_eval references the qrels key from the dataset registry, not the topics key.
/runs/[run_id] Method field shows the display name (Q2D (FS) etc.) not the raw method_id.
/about no longer claims every row ships a .run.txt and queries.tsv — those are optional under the current schema; path includes the {retriever} segment that PR Schema: optional artifacts + DL-HARD dataset entry #20 added.
Replaces duplicate cell + chip-bar code across 5 pages with two shared components: MatrixCell.astro (link + primary/secondary spans + sort hooks) and FilterChips.astro (groups + metric special-case + reapply event).

Test plan

python -m pytest reproducibility/tests/ — 44/44 passing
pnpm --filter @qg/leaderboard build — clean (1113 pages built)
/datasets/beir-v1.0.0-scifact: single metric column with nDCG@10 + R@100 toggle, no recall_1000 phantom column
/datasets/msmarco-v1-passage.trecdl2019: no MAP phantom column
/models/ index card titles show display labels (gpt-4.1, Qwen2.5-72B-Instruct…)
/runs/* Method field shows Q2D (FS) / Q2D (COT) for query2doc variants
/runs/* reproduce snippet generates beir-v1.0.0-trec-covid.splade-pp-ed, not .flat.splade-pp-ed
Home page produces no max-h-[70vh] wrapper

🤖 Generated with Claude Code

…ric toggle on every table - per-dataset shard reads metrics from actual runs (no MAP/recall_1000 phantom columns) - shared FilterChips + MatrixCell components reused across home / dataset / method / model / retriever pages - every per-X table gets chip filters (method/model/retriever/metric as applicable) + metric toggle - pretty metric labels (nDCG@10, R@1k, R@100, MAP) everywhere - drop double scrollbar on home + per-dataset tables - /models index renders display label, not provider-prefixed id - /runs page shows method display name; reproduce snippet aligned to example pipeline with correct Pyserini index names and qrels-based trec_eval - /about page no longer claims run.txt/queries.tsv are guaranteed; path includes retriever segment Co-Authored-By: Claude Opus 4.7 <[email protected]>

…ed filter card - Wrap every table in a fixed-height card with a styled 8px thin scrollbar so the page chrome stays in view while rows scroll - Sticky thead inside the scroll container; sticky leftmost axis columns (Method/Model/Retriever, varies per page) with CSS-var-driven widths and a mobile fallback - Inline sort arrows on stacked dataset/metric column headers via a slot the table wires into - Filter chips moved into a dedicated card; metric toggle now also re-fires the current sort so row order matches the visible metric - MatrixCell always renders both metric spans (em-dash for missing) and uses the new .qg-cell-best highlight (accent + dark-mode glow) - Decimal precision unified at 4 across MatrixCell, side-by-side dataset cells, and the run-detail metrics table - /datasets/[id] renders both metrics side by side instead of a single-column toggle - /datasets/ index drops the stale eval_metrics badge - /runs/[run_id] reproduce snippet simplifies the qrels lookup - Stat cards gain hover:border-qg-accent; InteractiveTable search input restyled with magnifier icon; MetricCell removed (dead code) Co-Authored-By: Claude Opus 4.7 <[email protected]>

radinhamidi · 2026-05-20T18:37:09Z

Pushed a follow-up commit (2f6b74f) addressing the table-mechanics critique:

Sticky thead + sticky leftmost axis columns (Method/Model/Retriever per page) with themed thin scrollbar inside a fixed-height table card.
Filter chips wrapped in a dedicated card; metric toggle now re-fires the current sort so rows match the visible metric.
MatrixCell always renders both metric spans; precision unified at 4 digits everywhere; new .qg-cell-best highlight (accent + dark-mode glow).
/datasets/[id] renders both metrics side by side instead of a one-column toggle.
/datasets/ index drops the stale metrics: badge; /runs/[run_id] reproduce snippet simplifies the qrels lookup.
Sort arrows on stacked dataset/metric headers now sit inline with the dataset name (slot-based, no longer drop to a new line).
Stat cards gain hover-border polish; deleted dead MetricCell.astro.

radinhamidi and others added 2 commits May 20, 2026 03:17

radinhamidi merged commit e60fa48 into main May 20, 2026
2 checks passed

radinhamidi deleted the leaderboard/critical-table-revision branch May 20, 2026 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leaderboard: per-dataset metric derivation, chip filters + metric toggle on every table#29

leaderboard: per-dataset metric derivation, chip filters + metric toggle on every table#29
radinhamidi merged 2 commits into
mainfrom
leaderboard/critical-table-revision

radinhamidi commented May 20, 2026

Uh oh!

radinhamidi commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

radinhamidi commented May 20, 2026

Summary

Test plan

Uh oh!

radinhamidi commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant